Skip to content

fix: DeepSeek-OCR server crash, template routing, and CUDA OOM fallback#23394

Closed
soymh wants to merge 1 commit into
ggml-org:masterfrom
soymh:feat/deepseek-ocr
Closed

fix: DeepSeek-OCR server crash, template routing, and CUDA OOM fallback#23394
soymh wants to merge 1 commit into
ggml-org:masterfrom
soymh:feat/deepseek-ocr

Conversation

@soymh
Copy link
Copy Markdown

@soymh soymh commented May 20, 2026

Upstream PR: #17400 Original implementation by sfallah (sf/deepseek-ocr branch):
https://github.com/sfallah/llama.cpp

Co-authored-by: sfallah sfallah@users.noreply.github.com

Fixes:

  • server crash: GGML_ASSERT(batch.n_tokens > 0) when mtmd image processing consumes all prompt tokens (inject synthetic token before assertion)
  • server crash: slot.task NULL dereference after release() on mtmd OOM
  • server crash: ggml_backend_sched_alloc_graph segfault when CUDA OOM (check return value, matching sfallah's upstream guard pattern)
  • template routing: --chat-template deepseek-ocr was rendered as Jinja text instead of resolving to the built-in template (auto-detect + legacy fallback)
  • bitmap/marker mismatch: get_media_marker() returned random string instead of mtmd_default_marker(), causing tokenizer to split on wrong marker
  • CUDA OOM: tensor loading falls back to CPU backend when GPU allocation fails
  • mmproj GPU: -ngl 0 now also disables mmproj GPU (matches user intent)

Overview

Additional information

Requirements

Upstream PR: ggml-org#17400
Original implementation by sfallah (sf/deepseek-ocr branch):
  https://github.com/sfallah/llama.cpp

Co-authored-by: sfallah <sfallah@users.noreply.github.com>

Fixes:
- server crash: GGML_ASSERT(batch.n_tokens > 0) when mtmd image processing
  consumes all prompt tokens (inject synthetic token before assertion)
- server crash: slot.task NULL dereference after release() on mtmd OOM
- server crash: ggml_backend_sched_alloc_graph segfault when CUDA OOM
  (check return value, matching sfallah's upstream guard pattern)
- template routing: --chat-template deepseek-ocr was rendered as Jinja text
  instead of resolving to the built-in template (auto-detect + legacy fallback)
- bitmap/marker mismatch: get_media_marker() returned random string instead
  of mtmd_default_marker(), causing tokenizer to split on wrong marker
- CUDA OOM: tensor loading falls back to CPU backend when GPU allocation fails
- mmproj GPU: -ngl 0 now also disables mmproj GPU (matches user intent)
@soymh soymh requested review from a team as code owners May 20, 2026 08:18
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 20, 2026

there changes are too invasive, break all other models, we cannot accept

@ngxson ngxson closed this May 20, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 20, 2026

ref: #23345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants